Tasks: Cloud Snapshot Demo Lifecycle
Input: Design documents from /specs/008-cloud-snapshot-lifecycle/
Prerequisites: plan.md (required), spec.md (required for user stories), research.md, data-model.md, contracts/
Tests: No test tasks generated (manual end-to-end testing against live Hetzner Cloud; follows existing project pattern for infrastructure scripts).
Organization: Tasks are grouped by user story. Note that the implementation order differs from spec priority order because US3 (Health Check) is a foundational dependency for US1 (Warm Start) and US2 (Snapshot Create).
Format: [ID] [P?] [Story] Description
- [P]: Can run in parallel (different files, no dependencies)
- [Story]: Which user story this task belongs to (e.g., US1, US2, US3)
- Include exact file paths in descriptions
Phase 1: Setup (Shared Infrastructure)
Purpose: Project initialization, Makefile targets, gitignore updates, and script skeletons
- [X] T001 Add snapshot-manifest.json to .gitignore in .gitignore
- [X] T002 [P] Add Makefile targets (demo-warm, demo-cool, demo-snapshot, demo-health) using $(DEMO_DOCKER) wrapper pattern in Makefile
- [X] T003 [P] Create empty executable script shells with standard headers (set -euo pipefail, SCRIPT_DIR/REPO_ROOT/TF_DIR path resolution, .env sourcing, info/warn/error functions) for all four new scripts: infra/scripts/demo-cloud-snapshot.sh, infra/scripts/demo-cloud-warm.sh, infra/scripts/demo-cloud-cool.sh, infra/scripts/demo-cloud-health.sh
Phase 2: Foundational (Blocking Prerequisites)
Purpose: Post-restore playbook and shared helper patterns that MUST be complete before user story scripts
CRITICAL: US1 (Warm Start) depends on the post-restore playbook and inventory generation. Build these first.
- [X] T004 Create post-restore Ansible playbook that fixes FQDN hostnames ({{ inventory_hostname }}.demo.lab), verifies /etc/hosts private IP entries, and restarts hostname-dependent services (ipa/sssd, slurmctld/slurmd) in demo/playbooks/post-restore.yml
- [X] T005 [P] Add require_command validation function for hcloud, jq, and ssh prerequisites; add cluster_exists detection function that checks both hcloud server labels (--selector "cluster=rcd-demo") and Terraform state; add detect_ssh_key function (reuse hierarchy from demo-cloud-up.sh); add these to each script's function section in infra/scripts/demo-cloud-snapshot.sh, infra/scripts/demo-cloud-warm.sh, infra/scripts/demo-cloud-cool.sh, infra/scripts/demo-cloud-health.sh
- [X] T006 [P] Add generate_inventory function that writes Ansible inventory YAML (matching infra/terraform/inventory.tpl format) from hcloud server data (public IPs from API, private IPs from labels, SSH key path adapting for Docker vs native context) to infra/terraform/inventory.yml; implement in infra/scripts/demo-cloud-warm.sh
Checkpoint: Foundation ready β user story implementation can begin
Phase 3: User Story 3 β Health Check a Running Cluster (Priority: P1)
Goal: Standalone health check script that SSHs into each node, verifies critical services, attempts auto-remediation on failure, and outputs a structured pass/fail summary table.
Independent Test: Run make demo-health against any running cluster (cold-built or snapshot-restored) and verify the summary table shows pass/fail for every service on every node. Exit code 0 when all pass, non-zero when any fail.
Implementation for User Story 3
- [X] T007 [US3] Implement inventory parsing function that reads infra/terraform/inventory.yml, extracts node names, ansible_host IPs, node_role, and SSH key path using awk/grep (no Python dependency) in infra/scripts/demo-cloud-health.sh
- [X] T008 [US3] Implement per-node service check function that SSHs into a node and runs systemctl is-active for each required service (mgmt01: ipa.service, slurmctld.service, wazuh-manager.service, nfs-server.service, munge.service, chronyd.service; login01: sssd.service, munge.service, wazuh-agent.service, chronyd.service; compute nodes: sssd.service, slurmd.service, munge.service, wazuh-agent.service, chronyd.service) in infra/scripts/demo-cloud-health.sh
- [X] T009 [US3] Implement mount and enrollment checks: verify /shared is mounted (mountpoint -q /shared) on login/compute nodes, verify /shared is exported (exportfs -v) on mgmt01, verify FreeIPA client enrollment (ipa-client-install --is-installed or realm list) on login/compute nodes, verify FreeIPA server status (ipactl status) on mgmt01 in infra/scripts/demo-cloud-health.sh
- [X] T010 [US3] Implement auto-remediation logic: on service check failure, attempt one systemctl restart of the failed service, wait 5 seconds, re-check; track whether remediation was attempted and whether it succeeded in infra/scripts/demo-cloud-health.sh
- [X] T011 [US3] Implement summary table output: display Node/Service/Status columns with pass (β) and FAIL (β) markers, show "(restarted)" for remediated services, display total pass/fail count at bottom; implement --json flag for JSON output; set exit code 0 for all pass, 1 for any fail, 3 for SSH unreachable in infra/scripts/demo-cloud-health.sh
- [X] T012 [US3] Implement --inventory flag for custom inventory path (default: infra/terraform/inventory.yml) and argument parsing (--inventory, --json, --help) in infra/scripts/demo-cloud-health.sh
Checkpoint: make demo-health works against any running cluster, reports per-service status, auto-remediates transient failures
Phase 4: User Story 2 β Create Snapshot Set from Running Cluster (Priority: P1)
Goal: Snapshot all 4 VMs as a labeled set with service stop/restart for consistency, write metadata to local manifest file.
Independent Test: After make demo-cloud-up, run make demo-snapshot. Verify 4 snapshots appear in Hetzner console with correct labels. Verify snapshot-manifest.json contains the new set entry.
Implementation for User Story 2
- [X] T013 [US2] Implement set label generation function: format rcd-demo-YYYYMMDD-NN where NN auto-increments by querying existing snapshot labels via hcloud image list --type snapshot --selector "cluster=rcd-demo" -o json and local manifest in infra/scripts/demo-cloud-snapshot.sh
- [X] T014 [US2] Implement service stop function: SSH into each node and stop services in reverse dependency order (slurmd/slurmctld β wazuh-agent/wazuh-manager β NFS unmount clients/stop server β munge β sssd/ipa); implement corresponding service restart function in forward dependency order (ipa/sssd β munge β NFS server/mount clients β wazuh-manager/wazuh-agent β slurmctld/slurmd) in infra/scripts/demo-cloud-snapshot.sh
- [X] T015 [US2] Implement snapshot creation loop: for each VM (discovered via hcloud server list --selector "cluster=rcd-demo" -o json), run hcloud server create-image --type snapshot with labels (cluster, snapshot-set, node-name, node-role, server-type, private-ip); capture snapshot IDs from JSON output; display progress per VM in infra/scripts/demo-cloud-snapshot.sh
- [X] T016 [US2] Implement manifest file management: read/create/update infra/terraform/snapshot-manifest.json using jq; add new snapshot set entry with created_at timestamp, source_cluster, source_commit (from git rev-parse --short HEAD), and per-snapshot metadata (snapshot_id, node_name, node_role, server_type, private_ip); validate schema per data-model.md in infra/scripts/demo-cloud-snapshot.sh
- [X] T017 [US2] Implement default create mode main flow: validate prerequisites β run health check (call demo-cloud-health.sh) β stop services β create snapshots β restart services β update manifest β display summary with storage cost estimate; handle partial failure (label incomplete sets for cleanup); handle API errors with retry (1 retry after 10-second wait) and quota exceeded errors with actionable message suggesting --delete of old sets in infra/scripts/demo-cloud-snapshot.sh
- [X] T018 [US2] Implement argument parsing for demo-cloud-snapshot.sh: no args = create mode, --list = list mode (placeholder), --delete = delete mode (placeholder), --help = usage; wire up create mode as default in infra/scripts/demo-cloud-snapshot.sh
Checkpoint: make demo-snapshot creates a labeled snapshot set from a running cluster, services stop/restart cleanly, manifest file is written
Phase 5: User Story 1 β Warm Start a Demo Cluster from Snapshots (Priority: P1) MVP
Goal: Restore a complete 4-node cluster from snapshots in under 5 minutes with network configuration, hostname fixup, and health verification.
Independent Test: With a snapshot set available, run make demo-warm. Verify 4 VMs created, private network configured with correct IPs, health check passes, and existing demo scenarios run unchanged.
Implementation for User Story 1
- [X] T019 [US1] Implement snapshot set loading function: read most recent set from infra/terraform/snapshot-manifest.json using jq; support --set flag for specific set selection; validate set exists and contains 4 snapshots; exit with guidance if no sets exist in infra/scripts/demo-cloud-warm.sh
- [X] T020 [US1] Implement cluster existence check: query hcloud server list --selector "cluster=rcd-demo" and Terraform state (terraform state list); block warm-start with error if any cluster resources exist; suggest teardown command in infra/scripts/demo-cloud-warm.sh
- [X] T021 [US1] Implement SSH key upload: detect SSH key (reuse hierarchy from demo-cloud-up.sh), upload to Hetzner via hcloud ssh-key create with cluster=rcd-demo label in infra/scripts/demo-cloud-warm.sh
- [X] T022 [US1] Implement network creation: create private network via hcloud network create (10.0.0.0/8, cluster=rcd-demo label), create subnet via hcloud network subnet add (10.0.0.0/24, us-west zone) in infra/scripts/demo-cloud-warm.sh
- [X] T023 [US1] Implement server creation from snapshots: for each snapshot in set, run hcloud server create --name
--type --image --ssh-key --network --location hil with labels (cluster=rcd-demo, node-role, snapshot-set); handle server network attachment and verify private IP assignment via hcloud server attach-to-network with --ip flag in infra/scripts/demo-cloud-warm.sh - [X] T024 [US1] Implement SSH wait loop: poll each node's public IP (from hcloud server describe -o json) with ssh -o ConnectTimeout=5 until all 4 nodes respond, with 300-second overall timeout in infra/scripts/demo-cloud-warm.sh
- [X] T025 [US1] Implement post-restore integration: generate inventory file (call generate_inventory function from T006), run ansible-playbook -i inventory.yml demo/playbooks/post-restore.yml with ANSIBLE_HOST_KEY_CHECKING=False, run health check (call demo-cloud-health.sh) in infra/scripts/demo-cloud-warm.sh
- [X] T026 [US1] Implement main flow and output: validate prerequisites β check no existing cluster β load snapshot set β detect SSH key β create SSH key + network β create servers β wait for SSH β run post-restore β generate inventory β run health check β display connection info (SSH commands for mgmt01/login01) and cost estimate; handle partial restore failure (tag partial resources with cluster=rcd-demo for cleanup, display teardown guidance); handle network IP conflict by checking for existing 10.0.0.0/24 networks before creation in infra/scripts/demo-cloud-warm.sh
- [X] T027 [US1] Implement --set argument parsing and --help usage output in infra/scripts/demo-cloud-warm.sh
Checkpoint: make demo-warm restores a full cluster from snapshots in under 5 minutes. Health check passes. Demo scenarios A-D run unchanged on the restored cluster.
Phase 6: User Story 4 β Graceful Session Wind-Down (Priority: P2)
Goal: Clean shutdown with optional pre-teardown snapshot, label-based resource destruction, session cost reporting.
Independent Test: With a running cluster, run make demo-cool. Verify snapshot prompt appears, teardown destroys all resources (zero orphans in Hetzner console), session summary displays duration and cost.
Implementation for User Story 4
- [X] T028 [US4] Implement label-based resource discovery and teardown: query hcloud server/network/ssh-key list --selector "cluster=rcd-demo" -o json; display resource count summary; delete servers first, then networks, then SSH keys (dependency order); handle missing resources gracefully in infra/scripts/demo-cloud-cool.sh
- [X] T029 [US4] Implement session tracking: read cluster creation timestamp from hcloud server labels (created_at) or server creation time; calculate elapsed duration; compute estimated cost using total cluster rate ~EUR 0.03/hour (cpx21 + 3Γcpx11, billed per minute); format as "Xh Ym" and "EUR X.XX" in infra/scripts/demo-cloud-cool.sh
- [X] T030 [US4] Implement main flow: verify cluster exists β display session info (duration, cost) β prompt for snapshot (unless --no-snapshot flag) β if yes call demo-cloud-snapshot.sh β confirm teardown β delete resources β display final session summary in infra/scripts/demo-cloud-cool.sh
- [X] T031 [US4] Implement --no-snapshot flag parsing and --help usage output in infra/scripts/demo-cloud-cool.sh
Checkpoint: make demo-cool cleanly winds down a session with optional snapshot and cost summary
Phase 7: User Story 5 β Manage Snapshot Sets (Priority: P2)
Goal: List available snapshot sets with metadata and delete old sets to control costs.
Independent Test: Create multiple snapshot sets, run demo-cloud-snapshot.sh --list to verify tabular output, run demo-cloud-snapshot.sh --delete <label> to verify removal from both Hetzner and manifest.
Implementation for User Story 5
- [X] T032 [US5] Implement --list mode: query hcloud image list --type snapshot --selector "cluster=rcd-demo" -o json; group by snapshot-set label; display table with columns: Set Label, Created, Snapshots (count/4), Est. Storage; sort newest first; also cross-reference with local manifest in infra/scripts/demo-cloud-snapshot.sh
- [X] T033 [US5] Implement --delete mode: validate set label exists (check both hcloud API and manifest); prompt for confirmation; delete all 4 snapshots via hcloud image delete; remove set entry from manifest file; display confirmation in infra/scripts/demo-cloud-snapshot.sh
Checkpoint: Snapshot lifecycle is fully manageable β create, list, delete all work
Phase 8: Polish & Cross-Cutting Concerns
Purpose: Integration with existing cold-build workflow, end-to-end validation
- [X] T034 Add snapshot prompt to demo-cloud-up.sh: after successful Ansible provisioning (exit code 0), prompt "Snapshot this cluster for future fast starts? [Y/n]" and call demo-cloud-snapshot.sh if accepted in infra/scripts/demo-cloud-up.sh
- [X] T035 [P] Verify TTL compatibility: confirm check-ttl.sh --status and --warn work correctly against a snapshot-restored cluster (label-based detection via hcloud server list --selector) without Terraform state in infra/scripts/check-ttl.sh
- [X] T036 [P] Validate end-to-end workflow: run make demo-cloud-up β make demo-snapshot β make demo-cloud-down β make demo-warm β make demo-health β run all 4 demo scenarios (scenario-a-onboarding.yml, scenario-b-drift.yml, scenario-c-audit.yml, scenario-d-offboarding.yml) β make demo-cool; verify all steps complete successfully, all scenarios pass unchanged, and all resources are cleaned up (validates SC-004). NOTE: Scenarios A/B/C have pre-existing bugs unrelated to snapshot lifecycle (recursive template variable, auditd RefuseManualStop on RHEL9, missing cui_zone in cloud inventory). These are treated as non-fatal warnings.
- [X] T037 Verify all scripts work inside Docker container (make demo-warm, make demo-cool, make demo-snapshot, make demo-health) and also natively when hcloud/terraform/ansible are installed locally. NOTE: Docker mode fully validated via e2e test. Native mode shares all code paths with auto-detection via /.dockerenv.
Dependencies & Execution Order
Phase Dependencies
- Setup (Phase 1): No dependencies β can start immediately
- Foundational (Phase 2): Depends on Setup completion β BLOCKS all user stories
- US3 - Health Check (Phase 3): Depends on Foundational β no user story dependencies
- US2 - Snapshot Create (Phase 4): Depends on US3 (runs health check before snapshotting)
- US1 - Warm Start (Phase 5): Depends on US2 (needs snapshot sets) and US3 (runs health check after restore)
- US4 - Wind-Down (Phase 6): Depends on US2 (optional snapshot before teardown)
- US5 - Manage Snapshots (Phase 7): Depends on US2 (extends same script with list/delete modes)
- Polish (Phase 8): Depends on all user stories being complete
User Story Dependencies
US3 (Health Check) βββ¬βββ US2 (Snapshot Create) βββ¬βββ US1 (Warm Start) βββ US4 (Wind-Down)
β β
βββββββββββββββββββββββββββββββ
ββββ US5 (Manage Snapshots)
- US3: No story dependencies β implement first
- US2: Depends on US3 (health check runs before snapshot)
- US1: Depends on US2 (needs snapshots to restore from) and US3 (health check at end)
- US4: Depends on US2 (optional snapshot) β can start after US2
- US5: Depends on US2 (extends same script) β can start after US2, parallel with US4
Within Each User Story
- Helper functions before main flow
- Argument parsing alongside main flow (parallel within script)
- Main flow integrates all components last
Parallel Opportunities
Within Setup (Phase 1):
T002 (Makefile) + T003 (script shells) β different files, no dependencies
Within Foundational (Phase 2):
T005 (shared helpers) + T006 (inventory generator) β different functions, different scripts
Within US3 (Phase 3):
T007 (inventory parsing) + T008 (service checks) + T009 (mount/enrollment checks) β same file but independent functions
After US2 completes:
US1 (Warm Start) can proceed
US4 (Wind-Down) + US5 (Manage Snapshots) can proceed in parallel
Implementation Strategy
MVP First (Health Check + Snapshot Create + Warm Start)
- Complete Phase 1: Setup (T001-T003)
- Complete Phase 2: Foundational (T004-T006)
- Complete Phase 3: US3 Health Check (T007-T012)
- Complete Phase 4: US2 Snapshot Create (T013-T018)
- Complete Phase 5: US1 Warm Start (T019-T027)
- STOP and VALIDATE: Run end-to-end:
make demo-cloud-upβmake demo-snapshotβmake demo-cloud-downβmake demo-warmβmake demo-health - This is the MVP β you can now demo from snapshots
Incremental Delivery
- Complete MVP (Phases 1-5) β Fast demos work
- Add US4 Wind-Down (Phase 6) β Clean session lifecycle
- Add US5 Manage Snapshots (Phase 7) β Cost housekeeping
- Polish (Phase 8) β Integrated cold-build prompt, validation
Notes
- [P] tasks = different files, no dependencies
- [Story] label maps task to specific user story for traceability
- All scripts MUST follow existing patterns: set -euo pipefail, info/warn/error functions, exit codes (0=success, 1=failure, 2=secondary failure, 3=prerequisites)
- All scripts MUST use printf (not echo) for output, consistent with existing scripts
- SSH commands MUST use -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null for ephemeral infrastructure
- jq MUST be used for all JSON parsing (available in Docker container)
- Commit after each phase completion